LLM 25-Day Course - Day 11: Multimodal Models

Day 11: Multimodal Models

Multimodal models process not only text but also images, audio, video, and other forms of input simultaneously. Since 2024, most cutting-edge LLMs support multimodal capabilities.

Multimodal Model Comparison (By Model Family)

Model FamilyInputOutputFeatures
OpenAI MultimodalText + Image (+Audio)Text (+Audio)General-purpose API, rich ecosystem
Claude MultimodalText + ImageTextStrong in document/chart interpretation
Gemini MultimodalText + Image (+Video)TextLong context/video capabilities
LLaVA/Qwen-VLText + ImageTextOpen-source, easy local execution

Version/model IDs change frequently, so check each provider’s model listing documentation before use.

OpenAI Vision API (Chat Completions Compatible Example)

from openai import OpenAI
import base64

client = OpenAI()

# Method 1: Pass image via URL
response = client.chat.completions.create(
    model="gpt-4o",  # Can be replaced with the latest multimodal model available in your project
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What do you see in this image? Please describe it."},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/sample.jpg"},
                },
            ],
        }
    ],
    max_tokens=500,
)
print(response.choices[0].message.content)

# Method 2: Pass local image as base64
def encode_image(image_path):
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

image_base64 = encode_image("screenshot.png")
response = client.chat.completions.create(
    model="gpt-4o",  # Can be replaced with the latest multimodal model available in your project
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Analyze the error in this screenshot."},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/png;base64,{image_base64}"},
                },
            ],
        }
    ],
)
print(response.choices[0].message.content)

Claude Vision API

import anthropic
import base64

client = anthropic.Anthropic()

def encode_image(path):
    with open(path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

image_data = encode_image("chart.png")

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": image_data,
                    },
                },
                {
                    "type": "text",
                    "text": "Analyze the data in this chart and explain the key trends.",
                },
            ],
        }
    ],
)
print(message.content[0].text)

LLaVA Local Execution (Open-Source)

# Run LLaVA with Ollama (local, free)
# ollama pull llava:13b
import ollama

response = ollama.chat(
    model="llava:13b",
    messages=[
        {
            "role": "user",
            "content": "Describe this image.",
            "images": ["./photo.jpg"],  # Local image path
        }
    ],
)
print(response["message"]["content"])

# Comparing multiple images is also possible
response = ollama.chat(
    model="llava:13b",
    messages=[
        {
            "role": "user",
            "content": "Find the differences between these two images.",
            "images": ["./before.jpg", "./after.jpg"],
        }
    ],
)
print(response["message"]["content"])

Multimodal Use Cases

ApplicationDescriptionSuitable Models
Document OCR + AnalysisRead scanned documents and summarize contentOpenAI/Claude/Gemini Multimodal
Code Screenshot DebuggingAnalyze error screens to identify root causesOpenAI/Claude Multimodal
Chart/Graph InterpretationDescribe visualized data as textClaude/Gemini Multimodal
Product Image ClassificationClassify product photos by categoryLLaVA (local, cost-saving)
UI/UX ReviewAnalyze app screenshots and suggest improvementsOpenAI/Claude/Gemini Multimodal
Medical Imaging AssistanceInitial analysis of X-ray, CT imagesSpecialized models required

The key insight about multimodal models is that they don’t truly “understand” images — they convert images into token sequences to process them alongside text. An image encoder (usually ViT-based) converts the image into vectors, which are then combined with the language model’s input.

Today’s Exercises

  1. Use a commercial multimodal API (choose from OpenAI/Claude/Gemini) to analyze your own screenshot. Among text recognition, layout understanding, and semantic comprehension, which does it perform best at?
  2. Run LLaVA locally via Ollama and compare the response quality with a commercial multimodal API for the same image.
  3. Research the relationship between image resolution and token count in multimodal models. How much does the cost increase when sending high-resolution images?

Was this article helpful?